Large language model scoring of medical student reflection essays: Accuracy and reproducibility of prompt-model variations
This study demonstrates that large language models can achieve near-perfect accuracy and reproducibility in scoring medical student reflection essays, with fine-tuned models and prompts containing exemplars offering the highest performance while cost-effectiveness varies based on essay volume.